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Abstract. We studied the knowledge gap in GenBank with regard to the ca. 6oo anuran species from Amazonia. The 
markers 12S, 16S, COI and cytb were examined, on which information was available for about half of all species. Both the 
number of sample sites and the number of samples per species varied greatly (best studied each in 16S: 4.85 ± 10.37; 11.19 ± 
31.20), and merely one fifth of all species had at least 5 sample sites. This suggests that a considerable portion of species is 
underrepresented in GenBank. Representativeness is especially difficult to assess in widespread species that at the same 
time could well represent cryptic allopatric species (i.e., with smaller distributions). This is a well-known phenomenon in 
Amazonian anurans considering that truly widespread species do exist. Moreover, limited sampling may not necessarily 
be the result of limited representativeness, as numerous species are known to occupy relatively small localised to regional 
ranges only. Our study furthermore revealed that in a geographic context, major portions of Amazonia have as yet been 
undersampled. That is, the total of 453 sample sites (most with more than one species sampled) are spatially clustered, often 
in areas with increased anthropogenic activity. We conclude that there is a large knowledge gap in terms of spatial sam¬ 
pling, resulting in taxonomic deficiencies. 
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Introduction 

The Amazon Basin is one of the mega-diversity regions 
of the globe, and it has an iconic status in biogeograph¬ 
ic and evolutionary research (Hoorn et al. 2010, Jenkins 
et al. 2013). One of the most diverse animal groups in this 
region is amphibians. Of the worldwide more than 7,800 
amphibian species known to date (AmphibiaWeb 2018), 
several hundreds have been recorded from from Amazo¬ 
nia, of which most are members of the order Anura (IUCN 
2017). Various studies aiming at a better understanding of 
Amazonian anuran diversity have to an increasing extent 
become available over the last two decades (e.g., Noonan 
& Gaucher 2005, Santos et al. 2009, Wiens et al. 2011, 
Duellman et al. 2016, Gehara et al. 2014). These modern 
approaches basically make use of molecular markers, and it 
is mandatory to most authors that data be stored and made 
public via online databases. Regarding amphibians, NCBI 
GenBank (Benson et al. 2015) is broadly used for informa¬ 
tion storage (e.g., Vences et al. 2005, Vences & Kohler 
2006, Vieites et al. 2009, Che et al. 2012). 

Because anurans are so diverse in Amazonia and have 
been attracting ever-greater study interest, they have been 
proposed as suitable when studying more general research 
questions with regard to the genesis of Amazonian biota 


(e.g., Azevedo-Ramos & Galatti 2002, Buckley & Jetz 
2007, Zeisset & Beebee 2008, Antonelli et al. 2018). 
Concerning this potential role as a model group in biogeo¬ 
graphic and evolutionary research, one may ask how well 
the various species are represented in GenBank, especial¬ 
ly as more than a decade ago, Latin American amphibi¬ 
ans were considered to be ‘under-represented’ in GenBank 
(Vences & Kohler 2006). Assessing the biogeographic- 
taxonomic knowledge gap is especially relevant, as current 
studies do (e.g., Pyron & Wiens 2011), and forthcoming 
ones are expected to, increasingly make use of GenBank 
information. 

We assessed Amazonian anuran species in GenBank 
with the focus on four markers and here provide informa¬ 
tion on the species included, the number of sample sites, 
and samples per species. We furthermore analyse the avail¬ 
able data in a geographic context. 

Methods 

Geographic focus 

There is no universal definition of Amazonia in the litera¬ 
ture (Goulding et al. 2003). In our study, we combined 25 
global WWF Terrestrial Ecoregions as defined by Olson 
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et al. (2001). This area, 5,959,751 km 2 in size (Fig. 1), en¬ 
compasses all piedmont-lowland moist, 'varzea' and rain 
forest units plus two associated moist savannas within the 
Amazon and Tocantins river catchments and additional¬ 
ly incorporates parts of the Guiana Shield (Supplementa¬ 
ry data 1). Geographical data were obtained from: www. 
worldwildlife.org/publications/terrestrial-ecoregions-of- 
the-world (accessed 25 November 2016). They were proc¬ 
essed with ESRI ArcGIS 10.2. 


GenBank search 

We initially used the IUCN Red Tist of Threatened Species 
(IUCN 2017) and the GIS-ready shapefiles available from 
it (www.iucnredlist.org/technical-documents/spatial-data) 
to identify anuran species native to Amazonia, as defined 
above. From these 609 species, we excluded those with less 
than 20% of their total distribution within Amazonia from 


further analyses, i.e., we consider them ‘non-Amazonian 
(Supplementary data 2). This left us with 494 species that 
are partly or entirely distributed in Amazonia. Despite reg¬ 
ular updates, the IUCN Red List lags behind the progress 
in taxonomy. Therefore, we used Frost (2017) and identi¬ 
fied another 18 species described from our focal region be¬ 
tween 2014 and 2017 and not yet considered by the IUCN 
(Supplementary data 3). 

The combined list of 512 species names was used as an 
operational tool to run GenBank searches (www.ncbi.nlm. 
nih.gov/genbank) for sequence availability via the ‘nucle¬ 
otide search’ function. Because the IUCN Red List is be¬ 
hind taxonomic progress and taxonomic changes are not 
carried forward to GenBank at all, we also used both old 
(synonymous) and most recent names as available from 
Frost (2017). We aimed at four mitochondrial (mt) mark¬ 
ers, widely used in Neotropical anuran research (e.g., 
Vences et al. 2005, Fouquet et al. 2007a, Vieites et al. 
2009, Che et al. 2012, Gehara et al. 2014, Peloso et al. 



Figure 1. Delimitation of ‘Amazonia’ as a composite of 25 WWF Terrestrial Ecoregions highlighted in grey (Supplementary data 1). Dots 
represent 774 sample sites (453 of which in Amazonia) of anuran species that have at least 20% distribution overlap with ‘Amazonia’. 
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Table 1. Quantitative data of four studied markers of Amazonian anurans in GenBank (for details see Supplementary data 4). Means 
are followed by standard deviations and ranges in parentheses. 


Marker 

Number of 
species 

Number of sample sites 
per species 

Number of samples 
per species 

12S 

254 

3.40 ± 6.25 (1-54) 

6.26 ± 14.12 (1-123) 

16S 

293 

4.85 ± 10.37 (1-138) 

11.19 ± 31.20 (1-394) 

COI 

117 

1.83 ± 10.33 (1-147) 

3.34 ± 23.57 (1-392) 

cytb 

151 

1.91 ± 6.72 (1-77) 

4.10 ± 19.80 (1-287) 


2014, Ferrao et al. 2016): ribosomal rRNA subunites 12S 
and 16S (12S and 16S, respectively), Cytochrome Oxidase 1 
(COI), and Cytochrome b (cytb). Genetic data were avail¬ 
able for a total of 348 entries from our operational list of 
species names (not necessarily for all four markers). Addi¬ 
tionally, we recorded the number of samples (individuals) 
studied for each marker; data are provided in Supplemen¬ 
tary data 4. In the process, currently valid names, as avail¬ 
able from Frost (2017), were added to all names. 

Geographic sample site allocation 

For 40 of the 348 species names, locality information was 
lacking, or was given so imprecisely that spatial uncertain¬ 
ty was too great for integration into this study (e.g., refer¬ 
ring to an entire river system of several hundred kilome¬ 
tres in length, or an entire country). For the resulting 308 
species names (Supplementary data 4), geo-referenced lo¬ 
cality data (latitude-longitude) were directly adopted from 
GenBank entries (listed in Supplementary data 5). Where 
not available, we searched for more precise locality infor¬ 
mation in the publications referred to in GenBank and 
used Google Earth 7. In this manner, a total of 1,558 geo- 
referenced records (i.e., sample sites) were obtained for the 
308 species names. The elimination of duplicates left 774 
unique sample sites (regardless of how many species were 
recorded from a single site). Of these, 453 (1,107 records) 
were located within the predefined region Amazonia, 
whereas the remaining ones were extralimital (Fig. 1). 

Geographic data analysis 

For an analysis of the resulting point pattern, a multi-dis¬ 
tance spatial cluster analysis was performed using the 453 
unique sample sites from Amazonia in terms of an L func¬ 
tion using ArcGIS. The L function is a variance-stabilized 
derivate of Ripley’s K function (Besag 1977) and uses a 
random point pattern following a Poisson distribution. 
If the observed function is greater than the function de¬ 
rived from the point pattern generated at random, the fo¬ 
cal points (i.e., sample sites) are geographically clustered 
(Haase 1995). 

With the goal of explaining the geographic pattern, we 
assessed the Human Footprint Index (HFI) at sample sites 
within Amazonia in order to examine if their spatial dis¬ 


tribution was influenced by increased anthropogenic ac¬ 
tivity. The HFI is based on population density, extents of 
infrastructure and agriculture, and other landscape fea¬ 
tures (Sanderson et al. 2002). Grid-based HFI values in 
the range 0-100 (i.e., from ‘mostly wild’ to ‘high anthropo¬ 
genic impacts’) are available at a 30 arc-sec resolution from 
the ‘Last of the Wild’ project (http://sedac.desm.Columbia, 
edu/wildareas, last accessed 18 May 2017). Using ArcGIS, 
we extracted HFI values at sample sites and tested if their 
means were significantly different to those of all grid cells 
with no collection activity within the area previously de¬ 
fined as Amazonia (Mann-Whitney U-test for non-para- 
metric data). 


Results 

Representation of Amazonian anurans 
in GenBank data 

According to the most recent taxonomy, the 308 species 
names account for 305 species (Supplementary data 4). Our 
data search revealed that amongst these, the total number 
of georeferenced sample sites per species across their entire 
distributions ranged from 1-147 (mean 5.07 ± 11.98). With¬ 
in Amazonia only, the range was 1-74 (mean 3.69 ± 7.57). 
About one third (i.e., 116) of all species had only 1, more 
than half (i.e., 170) had < 2, and merely about one fifth (i.e., 
59) had > 5 sample sites within Amazonia. 

Regarding the individual four markers, information was 
available for 12S and 16S in 82.47 and 95.13% of all 308 spe¬ 
cies names, respectively; it was comparatively less for the 
other two markers (Table 1). Mean values of all markers 
were notably low (with a high standard deviation) in terms 
of both the number of sample sites and the number of sam¬ 
ples. Sampling effort was high in a few species, however; 
this resulted in high upper ranges in these two parameters, 
which were especially high in 16S and COI (Table 1). Ta¬ 
ble 2 provides an overview of the 20 best-studied species 
over all four markers. For more comprehensive informa¬ 
tion see Supplementary data 4. 

Examination of geographic sampling effort 

As is illustrated in Fig. 1, samples sites are unevenly dis¬ 
tributed both within Amazonia and beyond. In accordance 
with this pattern, the L function analysis indicated that 
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Table 2. The 20 best sampled Amazonian anuran species in GenBank in alphabetical order. We here show the 10 highest values of the 
number of sample sites for each of the four genes in Supplementary data 4 and accumulated species names. 


Species 

Sample 
sites within 
Amazonia 

Sample sites Samples Sample sites Samples Sample sites Samples Sample sites Samples 
12S 12S 16S 16S COI COI cytb cytb 

Adenomera andreae 

74 

36 

91 

36 

88 

77 

77 

77 

287 

Adenomera hylaedactyla 

43 

13 

27 

16 

27 

69 

79 

69 

76 

AUobates femoralis 

71 

44 

85 

69 

301 

5 

9 

33 

147 

Ameerega hahneli 

11 

10 

24 

14 

29 

4 

12 

13 

38 

Ameerega trivittata 

15 

19 

42 

19 

47 

7 

13 

15 

36 

Anomaloglossus baeobatrachus 

19 

26 

81 

26 

82 

1 

4 

1 

4 

Atelopus flavescens 

2 

3 

3 

3 

3 

4 

15 

4 

18 

Dendrobates tinctorius 

15 

4 

5 

4 

49 

2 

2 

16 

34 

Dendropsophus minutus 

33 

1 

1 

138 

394 

147 

392 

1 

1 

Engystomops petersi 

23 

30 

123 

30 

123 

1 

1 

0 

0 

Leptodactylus fuscus 

8 

23 

25 

23 

47 

1 

4 

1 

1 

Leptodactylus mystaceus 

18 

22 

55 

22 

48 

1 

1 

1 

2 

Osteocephalus buckleyi 

14 

16 

34 

10 

24 

12 

19 

3 

5 

Osteocephalus taurinus 

33 

54 

80 

58 

111 

14 

30 

18 

19 

Pristimantis zeuctotylus 

20 

23 

59 

20 

48 

0 

0 

0 

0 

Ranitomeya imitator 

5 

9 

17 

9 

18 

9 

15 

9 

15 

Ranitomeya variabilis 

18 

8 

15 

18 

34 

0 

0 

19 

30 

Ranitomeya ventrimaculata 

20 

19 

44 

19 

44 

14 

24 

19 

36 

Rhinella marina 

13 

5 

15 

15 

60 

15 

65 

15 

17 

Scinax ruber 

13 

22 

22 

28 

28 

0 

0 

18 

18 


sampling was significantly inhomogeneous across geo¬ 
graphic space (Fig. 2). Within the study region, some clus¬ 
ters are obvious in Fig. 1, including for instance: the up¬ 
per Amazon Basin in Ecuador and parts of Peru; the area 
around and south of Manaus (Brazil); the Guyana-Vene¬ 
zuela border area, and French Guiana. Proportionally well 



Distance 


Figure 2. L functions showing that 453 sample sites from within 
Amazonia (cf. Fig. 1) are significantly clustered in geographic 
space. The observed function (bold grey line) runs above the 
confidence envelopes (hatched thin lines) of the expected func¬ 
tion, derived from randomly distributed points (continuous thin 
black line). 


sampled are parts of some major Amazonian rivers, most 
notably the Rio Madeira. In accordance with these find¬ 
ings, the HFI for Amazonia showed higher human influ¬ 
ence at the sample sites (mean 15.85 ± 15.01) than all the 
non-sampled area (mean 7.56 ± 8.04; Fig. 3). The difference 
was highly significant at P < 0.001. On the other hand, huge 
areas of Amazonia are extremely poorly sampled; most no¬ 
tably a vast patch comprising eastern Colombia, western 
Brazil north of the Amazon River and western Venezuela. 


Discussion 

Biogeographic-taxonomic gap 

Sampling efforts in terms of both the number of sample 
sites per species and the number of samples in total varied 
considerably, with a high proportion of species represented 
by only one or a few samples or sample sites. These pure 
data a priori suggest that Amazonian anurans are gravely 
underrepresented in GenBank. This is also supported by 
the 164 species names for which no GenBank entries were 
available (Supplementary data 4). However, sampling ef¬ 
forts have to be regarded in a geographic context. For in¬ 
stance, about one sixth of the species examined here are 
suggested to occupy geographic ranges of < 500 km 2 (Sup¬ 
plementary data 2). That is, a taxon is not necessarily un¬ 
derrepresented when only a small number of samples are 
available (Fig. 4A). Local and regional spatial range restric- 
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tions (micro-endemism) are a common phenomenon in 
certain Amazonian amphibian groups, such as dendroba- 
toid frogs (Lotters et al. 2007, Brown et al. 2011). 

Figures 2 and 3 illustrate that there is a clear spatial 
sampling gap with major portions of Amazonia having 
remained as yet unsampled, a phenomenon described as 
‘missing areas’ by Sanmartin & Ronquist (2002) in the 
context of area cladograms. ‘Missing areas’ might espe¬ 
cially be responsible for the underrepresentation of spe¬ 
cies with large geographic ranges (Figs 4B, C). Flowever, 
this has to be regarded with particular care. Often, wide¬ 
spread Amazonian anurans turn out to represent complex¬ 
es of cryptic allopatric taxa when taxonomically studied 
using molecular genetics on the basis of broad sampling 
(e.g., Fouquet et al. 2007a,b, 2012, 2014, 2016, Brown et 
al. 2011, Jungfer et al. 2013, Peloso et al. 2014, Gehara 
et al. 2014, Ferrao et al. 2016). Due to the smaller distri¬ 
butions of these allopatrics, they are comparatively better 
sampled then. Hence, the assessment of how well a taxon 
is represented in GenBank is hampered in widespread spe¬ 
cies pending taxonomic clarification. A prime example is 
the poorly sampled Atelopus spumarius. It seems to have 
a relatively large geographic range across the Amazon Ba¬ 
sin (Fig. 4B), but at the same time is suggested to represent 
a complex of various taxa based on bioacoustics, osteolo¬ 
gy, larval and adult morphology (Lotters et al. 2002). 
On the other hand, some species (or widespread lineag¬ 
es within them) have been demonstrated to indeed occu¬ 
py large geographic ranges, such as Adenomera andreae, 


Ameerega trivittata, Boana boans, B. calcarata, Chiasmo- 
cleis avilapiresae, C. bassleri, Lithobates palmipes, Osteo- 
cephalus taurinus, or Pipapipa (Roberts et al. 2006, Fou¬ 
quet et al. 2007a, 2014, Angulo & Icochea 2010, Funk 
et al. 2011, Peloso et al. 2014). An intruiging observation 
is that in part these are amongst the best-sampled species 
(Fig. 4D; Table 2). Moreover, some species might be truly 
widespread taxa, such as Ceratophrys cornuta (Lynch 1982, 
Duellman 2005), that are underrepresented in GenBank, 
however (Fig. 4C). 

We conclude that a large knowledge gap exists for many 
Amazonian anurans species that are underrepresented in 
GenBank. This is not only due to spatial sampling, but also 
due to taxonomic deficiencies. It is not our goal to allocate 
particular species to certain categories of representative¬ 
ness here (to avoid the definition of artificial limits), how¬ 
ever, the following general patterns might apply: 

(A) Species with local to regional distributions that are 
(a) taxonomically well understood and relatively well rep¬ 
resented in GenBank (Fig. 4A); (b) taxonomically little un¬ 
derstood and poorly represented in, or absent from, Gen¬ 
Bank (Fig. 4A); (c) unknown but expected, especially when 
endemic to ‘missing areas’. 

(B) Unconfirmed widespread species that might mask 
complexes of unidentified cryptic taxa that are poorly rep¬ 
resented in GenBank (Fig. 4B). 

(C) Species that are confirmed to be truly widespread and 
(a) are poorly represented in GenBank (Fig. 4C); or (b) ad¬ 
equately represented in GenBank (Fig. 4D). 



Figure 3. Human Footprint Index (HFI) values across Amazonia (from ‘most wild’ to ‘high anthropogenic impacts’, i.e., dark green to 
bright red). GenBank samples sites are indicated by black dots. 
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Considerations on GenBank data 

As may perhaps be expected, sampling effort was highest 
for 16S. In the past, the mt 16 rRNA gene has been suggest¬ 
ed as a universal standard DNA barcoding marker in am¬ 
phibians (Vences et al. 2005) and was therefore favoured 
over COI in many studies. However, in more recent years, 
technical problems have been solved by the development 
of degenerate universal COI primers, and COI is on its way 
to ‘overtake’ 16S (Che et al. 2012, Peloso et al. 2014). This 
maybe is already reflected by our results for Amazonian 
anurans, as COI accounts for high numbers of samples and 
sample sites in some species, with the most prominent ex¬ 
ample being the Dendropsophus minutus species complex, 
which recently was the subject of comprehensive molecu¬ 
lar studies by Gehara et al. (2014). 

There is no control mechanism for species names allo¬ 
cated to samples deposited in GenBank and names are not 
updated according to ongoing taxonomic changes. This 


problem has repeatedly been pointed out before and is 
not particular to anurans (Harris 2003, Shen et al. 2013). 
However, it is markedly relevant here, given the progress in 
Amazonian anuran taxonomy. It might be assumed that it 
is highly probable in the cases of Amazonian anuran Gen¬ 
Bank samples that are not pooled under a valid name. This 
produces potential conflicts when adopting names from 
GenBank as has been done in this study. 
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Figure 4. Geographic ranges and GenBank sample sites (red dots) of (A) two Peruvian species ( Hyloxalus azureiventris, H. patitae (ar¬ 
row)) with locally restricted distributions of which one is comparartively well and the other not represented in GenBank; (B) Atelopus 
spwnarius, which occupies a relatively large geographic range, but is poorly sampled and likely represents of complex of distinct taxa; 
(C) Ceratophrys cornuta, an apparently truly widespread, but poorly sampled species; (D) the apparently truly widespread Adenomera 
andreae , which is among the best sampled of all Amazonian anuran species in GenBank. Distribution polygons were adopted from 
the IUCN Red List of Threatened Species, for details on sample sites see Supplementary data 4-5. 
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